Cuisines Dataset

A2
Author

Diya, Ashmita, Krithika, Arnav

Published

October 20, 2025

Allrecipes Cuisine Dataset

Exploring and Analysing patterns of an Allrecipes Cuisines Dataset

Data Dictionary

A data frame with 2218 rows and 17 variables:

Variable Description
name Name of the recipe
country The country/region the cuisine is from
url Link to the recipe
author Author of the recipe
date_published When the recipe was published/updated
ingredients The ingredients of the recipe
calories Calories per serving
fat Fat per serving
carbs Carbs per serving
protein Proteins per serving
avg_rating Average rating out of 5 stars
total_ratings Number of ratings received
reviews Number of written reviews
prep_time Preparation time in minutes
cook_time Cooking time in minutes
total_time Prep + cook time in minutes
servings Number of servings

Setting up R Packages

library(ggformula)
Loading required package: ggplot2
Loading required package: scales
Loading required package: ggridges

New to ggformula?  Try the tutorials: 
    learnr::run_tutorial("introduction", package = "ggformula")
    learnr::run_tutorial("refining", package = "ggformula")
library(janitor)

Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(mosaic)
Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.

Attaching package: 'mosaic'
The following objects are masked from 'package:dplyr':

    count, do, tally
The following object is masked from 'package:Matrix':

    mean
The following object is masked from 'package:scales':

    rescale
The following object is masked from 'package:ggplot2':

    stat
The following objects are masked from 'package:stats':

    binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
    quantile, sd, t.test, var
The following objects are masked from 'package:base':

    max, mean, min, prod, range, sample, sum
library(naniar)
library(skimr)

Attaching package: 'skimr'
The following object is masked from 'package:naniar':

    n_complete
The following object is masked from 'package:mosaic':

    n_missing
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.3.0
✔ purrr     1.0.4     ✔ tidyr     1.3.1
✔ readr     2.1.5     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ mosaic::count()     masks dplyr::count()
✖ purrr::cross()      masks mosaic::cross()
✖ purrr::discard()    masks scales::discard()
✖ mosaic::do()        masks dplyr::do()
✖ tidyr::expand()     masks Matrix::expand()
✖ dplyr::filter()     masks stats::filter()
✖ dplyr::lag()        masks stats::lag()
✖ tidyr::pack()       masks Matrix::pack()
✖ mosaic::stat()      masks ggplot2::stat()
✖ mosaic::tally()     masks dplyr::tally()
✖ tidyr::unpack()     masks Matrix::unpack()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tinytable)

Attaching package: 'tinytable'

The following object is masked from 'package:ggplot2':

    theme_void
library(visdat)
library(crosstable)

Attaching package: 'crosstable'

The following object is masked from 'package:purrr':

    compact
library(RColorBrewer)
library(naniar)
library(GGally)
library(ggplot2)
library(correlation)

Attaching package: 'correlation'

The following object is masked from 'package:mosaic':

    cor_test
library(dplyr)
library(RColorBrewer)
library(DT)

Viewing the dataset

data("cuisines", package = "tastyR")
glimpse(cuisines)   
Rows: 2,218
Columns: 17
$ name           <chr> "Saganaki (Flaming Greek Cheese)", "Coney Island Knishe…
$ country        <chr> "Greek", "Jewish", "Australian and New Zealander", "Chi…
$ url            <chr> "https://www.allrecipes.com/recipe/263750/flaming-greek…
$ author         <chr> "John Mitzewich", "John Mitzewich", "CHIPPENDALE", "Hei…
$ date_published <date> 2024-02-07, 2024-11-26, 2022-07-14, 2025-01-31, 2025-0…
$ ingredients    <chr> "1 (4 ounce) package kasseri cheese, 1 tablespoon water…
$ calories       <dbl> 391, 301, 64, 106, 449, 958, 378, 90, 157, 322, 4, NA, …
$ fat            <dbl> 25, 17, 3, 9, 23, 24, 10, 5, 6, 16, 0, NA, 21, 2, 66, 8…
$ carbs          <dbl> 15, 31, 9, 7, 58, 144, 59, 10, 25, 39, 1, NA, 16, 63, 7…
$ protein        <dbl> 16, 7, 1, 1, 7, 46, 14, 1, 2, 7, 0, NA, 28, 6, 54, 17, …
$ avg_rating     <dbl> 4.8, 4.6, 4.3, 5.0, 3.8, 4.4, 4.3, NA, 4.6, 5.0, 4.7, 4…
$ total_ratings  <dbl> 25, 10, 126, 1, 13, 40, 3, NA, 65, 2, 182, 2, 19, 16, 9…
$ reviews        <dbl> 22, 9, 104, 1, 11, 32, 3, NA, 55, 2, 138, 2, 15, 16, 84…
$ prep_time      <dbl> 10, 30, 20, 10, 30, 30, 30, 40, 0, 5, 5, 5, 10, 10, 20,…
$ cook_time      <dbl> 5, 75, 15, 0, 15, 165, 75, 30, 0, 5, 0, 25, 10, 50, 16,…
$ total_time     <dbl> 15, 180, 180, 10, 45, 675, 585, 155, 0, 10, 5, 30, 50, …
$ servings       <dbl> 2, 16, 12, 6, 15, 6, 6, 84, 24, 1, 21, 8, 4, 10, 4, 8, …

Replacing common NA representations with actual NA values in the dataset.

cuisines_modified <- cuisines %>%
  naniar::replace_with_na_all(data = ., condition = ~ .x %in% common_na_numbers) %>%
  naniar::replace_with_na_all(data = ., condition = ~ .x %in% common_na_strings)
glimpse(cuisines_modified)
Rows: 2,218
Columns: 17
$ name           <chr> "Saganaki (Flaming Greek Cheese)", "Coney Island Knishe…
$ country        <chr> "Greek", "Jewish", "Australian and New Zealander", "Chi…
$ url            <chr> "https://www.allrecipes.com/recipe/263750/flaming-greek…
$ author         <chr> "John Mitzewich", "John Mitzewich", "CHIPPENDALE", "Hei…
$ date_published <date> 2024-02-07, 2024-11-26, 2022-07-14, 2025-01-31, 2025-0…
$ ingredients    <chr> "1 (4 ounce) package kasseri cheese, 1 tablespoon water…
$ calories       <dbl> 391, 301, 64, 106, 449, 958, 378, 90, 157, 322, 4, NA, …
$ fat            <dbl> 25, 17, 3, 9, 23, 24, 10, 5, 6, 16, 0, NA, 21, 2, NA, 8…
$ carbs          <dbl> 15, 31, 9, 7, 58, 144, 59, 10, 25, 39, 1, NA, 16, 63, 7…
$ protein        <dbl> 16, 7, 1, 1, 7, 46, 14, 1, 2, 7, 0, NA, 28, 6, 54, 17, …
$ avg_rating     <dbl> 4.8, 4.6, 4.3, 5.0, 3.8, 4.4, 4.3, NA, 4.6, 5.0, 4.7, 4…
$ total_ratings  <dbl> 25, 10, 126, 1, 13, 40, 3, NA, 65, 2, 182, 2, 19, 16, 9…
$ reviews        <dbl> 22, 9, 104, 1, 11, 32, 3, NA, 55, 2, 138, 2, 15, 16, 84…
$ prep_time      <dbl> 10, 30, 20, 10, 30, 30, 30, 40, 0, 5, 5, 5, 10, 10, 20,…
$ cook_time      <dbl> 5, 75, 15, 0, 15, 165, 75, 30, 0, 5, 0, 25, 10, 50, 16,…
$ total_time     <dbl> 15, 180, 180, 10, 45, 675, 585, 155, 0, 10, 5, 30, 50, …
$ servings       <dbl> 2, 16, 12, 6, 15, 6, 6, 84, 24, 1, 21, 8, 4, 10, 4, 8, …

Viewing Missing Data

visdat::vis_miss(cuisines_modified)

visdat::vis_dat(cuisines_modified)

Removing Missing Data

cuisines_modified_new <- cuisines_modified %>% drop_na()
glimpse(cuisines_modified_new)
Rows: 1,989
Columns: 17
$ name           <chr> "Saganaki (Flaming Greek Cheese)", "Coney Island Knishe…
$ country        <chr> "Greek", "Jewish", "Australian and New Zealander", "Chi…
$ url            <chr> "https://www.allrecipes.com/recipe/263750/flaming-greek…
$ author         <chr> "John Mitzewich", "John Mitzewich", "CHIPPENDALE", "Hei…
$ date_published <date> 2024-02-07, 2024-11-26, 2022-07-14, 2025-01-31, 2025-0…
$ ingredients    <chr> "1 (4 ounce) package kasseri cheese, 1 tablespoon water…
$ calories       <dbl> 391, 301, 64, 106, 449, 958, 378, 157, 322, 4, 389, 253…
$ fat            <dbl> 25, 17, 3, 9, 23, 24, 10, 6, 16, 0, 21, 2, 8, 31, 19, 7…
$ carbs          <dbl> 15, 31, 9, 7, 58, 144, 59, 25, 39, 1, 16, 63, 53, 43, 4…
$ protein        <dbl> 16, 7, 1, 1, 7, 46, 14, 2, 7, 0, 28, 6, 17, 25, 7, 29, …
$ avg_rating     <dbl> 4.8, 4.6, 4.3, 5.0, 3.8, 4.4, 4.3, 4.6, 5.0, 4.7, 4.4, …
$ total_ratings  <dbl> 25, 10, 126, 1, 13, 40, 3, 65, 2, 182, 19, 16, 20, 43, …
$ reviews        <dbl> 22, 9, 104, 1, 11, 32, 3, 55, 2, 138, 15, 16, 15, 39, 2…
$ prep_time      <dbl> 10, 30, 20, 10, 30, 30, 30, 0, 5, 5, 10, 10, 15, 60, 10…
$ cook_time      <dbl> 5, 75, 15, 0, 15, 165, 75, 0, 5, 0, 10, 50, 15, 10, 45,…
$ total_time     <dbl> 15, 180, 180, 10, 45, 675, 585, 0, 10, 5, 50, 300, 60, …
$ servings       <dbl> 2, 16, 12, 6, 15, 6, 6, 24, 1, 21, 4, 10, 8, 6, 10, 6, …

Examining Data

summary(cuisines_modified_new)
     name             country              url               author         
 Length:1989        Length:1989        Length:1989        Length:1989       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 date_published       ingredients           calories           fat        
 Min.   :2009-02-09   Length:1989        Min.   :   3.0   Min.   :  0.00  
 1st Qu.:2022-11-08   Class :character   1st Qu.: 194.0   1st Qu.:  7.00  
 Median :2024-07-15   Mode  :character   Median : 318.0   Median : 15.00  
 Mean   :2023-11-08                      Mean   : 355.4   Mean   : 18.38  
 3rd Qu.:2024-12-19                      3rd Qu.: 475.0   3rd Qu.: 25.00  
 Max.   :2025-07-29                      Max.   :2010.0   Max.   :151.00  
     carbs           protein         avg_rating   total_ratings   
 Min.   :  1.00   Min.   :  0.00   Min.   :1.00   Min.   :  1.00  
 1st Qu.: 13.00   1st Qu.:  4.00   1st Qu.:4.30   1st Qu.:  6.00  
 Median : 26.00   Median : 12.00   Median :4.60   Median : 24.00  
 Mean   : 31.62   Mean   : 16.49   Mean   :4.51   Mean   : 87.23  
 3rd Qu.: 45.00   3rd Qu.: 25.00   3rd Qu.:4.80   3rd Qu.: 90.00  
 Max.   :264.00   Max.   :135.00   Max.   :5.00   Max.   :997.00  
    reviews         prep_time         cook_time        total_time   
 Min.   :  1.00   Min.   :   0.00   Min.   :  0.00   Min.   :    0  
 1st Qu.:  6.00   1st Qu.:  10.00   1st Qu.: 10.00   1st Qu.:   35  
 Median : 21.00   Median :  15.00   Median : 25.00   Median :   60  
 Mean   : 78.73   Mean   :  21.18   Mean   : 42.96   Mean   :  176  
 3rd Qu.: 76.00   3rd Qu.:  25.00   3rd Qu.: 45.00   3rd Qu.:  120  
 Max.   :975.00   Max.   :1800.00   Max.   :600.00   Max.   :14440  
    servings     
 Min.   :  1.00  
 1st Qu.:  4.00  
 Median :  8.00  
 Mean   : 10.29  
 3rd Qu.: 12.00  
 Max.   :200.00  
dim(cuisines_modified_new)
[1] 1989   17
skim(cuisines_modified_new)
Data summary
Name cuisines_modified_new
Number of rows 1989
Number of columns 17
_______________________
Column type frequency:
character 5
Date 1
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 4 87 0 1987 0
country 0 1 4 28 0 49 0
url 0 1 45 110 0 1989 0
author 0 1 1 35 0 1507 0
ingredients 0 1 29 1081 0 1989 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date_published 0 1 2009-02-09 2025-07-29 2024-07-15 696

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
calories 0 1 355.36 224.49 3 194.0 318.0 475.0 2010 ▇▅▁▁▁
fat 0 1 18.38 15.65 0 7.0 15.0 25.0 151 ▇▂▁▁▁
carbs 0 1 31.62 25.55 1 13.0 26.0 45.0 264 ▇▂▁▁▁
protein 0 1 16.49 15.61 0 4.0 12.0 25.0 135 ▇▂▁▁▁
avg_rating 0 1 4.51 0.39 1 4.3 4.6 4.8 5 ▁▁▁▂▇
total_ratings 0 1 87.23 150.36 1 6.0 24.0 90.0 997 ▇▁▁▁▁
reviews 0 1 78.73 144.62 1 6.0 21.0 76.0 975 ▇▁▁▁▁
prep_time 0 1 21.18 54.60 0 10.0 15.0 25.0 1800 ▇▁▁▁▁
cook_time 0 1 42.96 64.70 0 10.0 25.0 45.0 600 ▇▁▁▁▁
total_time 0 1 175.99 673.15 0 35.0 60.0 120.0 14440 ▇▁▁▁▁
servings 0 1 10.29 12.42 1 4.0 8.0 12.0 200 ▇▁▁▁▁
names(cuisines_modified_new)
 [1] "name"           "country"        "url"            "author"        
 [5] "date_published" "ingredients"    "calories"       "fat"           
 [9] "carbs"          "protein"        "avg_rating"     "total_ratings" 
[13] "reviews"        "prep_time"      "cook_time"      "total_time"    
[17] "servings"      
glimpse(cuisines_modified_new)
Rows: 1,989
Columns: 17
$ name           <chr> "Saganaki (Flaming Greek Cheese)", "Coney Island Knishe…
$ country        <chr> "Greek", "Jewish", "Australian and New Zealander", "Chi…
$ url            <chr> "https://www.allrecipes.com/recipe/263750/flaming-greek…
$ author         <chr> "John Mitzewich", "John Mitzewich", "CHIPPENDALE", "Hei…
$ date_published <date> 2024-02-07, 2024-11-26, 2022-07-14, 2025-01-31, 2025-0…
$ ingredients    <chr> "1 (4 ounce) package kasseri cheese, 1 tablespoon water…
$ calories       <dbl> 391, 301, 64, 106, 449, 958, 378, 157, 322, 4, 389, 253…
$ fat            <dbl> 25, 17, 3, 9, 23, 24, 10, 6, 16, 0, 21, 2, 8, 31, 19, 7…
$ carbs          <dbl> 15, 31, 9, 7, 58, 144, 59, 25, 39, 1, 16, 63, 53, 43, 4…
$ protein        <dbl> 16, 7, 1, 1, 7, 46, 14, 2, 7, 0, 28, 6, 17, 25, 7, 29, …
$ avg_rating     <dbl> 4.8, 4.6, 4.3, 5.0, 3.8, 4.4, 4.3, 4.6, 5.0, 4.7, 4.4, …
$ total_ratings  <dbl> 25, 10, 126, 1, 13, 40, 3, 65, 2, 182, 19, 16, 20, 43, …
$ reviews        <dbl> 22, 9, 104, 1, 11, 32, 3, 55, 2, 138, 15, 16, 15, 39, 2…
$ prep_time      <dbl> 10, 30, 20, 10, 30, 30, 30, 0, 5, 5, 10, 10, 15, 60, 10…
$ cook_time      <dbl> 5, 75, 15, 0, 15, 165, 75, 0, 5, 0, 10, 50, 15, 10, 45,…
$ total_time     <dbl> 15, 180, 180, 10, 45, 675, 585, 0, 10, 5, 50, 300, 60, …
$ servings       <dbl> 2, 16, 12, 6, 15, 6, 6, 24, 1, 21, 4, 10, 8, 6, 10, 6, …

Munging

cuisines_factor <- cuisines_modified_new %>%
  mutate(across(where(is.character), as.factor))%>%
  dplyr::relocate(where(is.factor), .before = name)
glimpse(cuisines_factor)
Rows: 1,989
Columns: 17
$ name           <fct> "Saganaki (Flaming Greek Cheese)", "Coney Island Knishe…
$ country        <fct> Greek, Jewish, Australian and New Zealander, Chilean, T…
$ url            <fct> https://www.allrecipes.com/recipe/263750/flaming-greek-…
$ author         <fct> "John Mitzewich", "John Mitzewich", "CHIPPENDALE", "Hei…
$ ingredients    <fct> "1 (4 ounce) package kasseri cheese, 1 tablespoon water…
$ date_published <date> 2024-02-07, 2024-11-26, 2022-07-14, 2025-01-31, 2025-0…
$ calories       <dbl> 391, 301, 64, 106, 449, 958, 378, 157, 322, 4, 389, 253…
$ fat            <dbl> 25, 17, 3, 9, 23, 24, 10, 6, 16, 0, 21, 2, 8, 31, 19, 7…
$ carbs          <dbl> 15, 31, 9, 7, 58, 144, 59, 25, 39, 1, 16, 63, 53, 43, 4…
$ protein        <dbl> 16, 7, 1, 1, 7, 46, 14, 2, 7, 0, 28, 6, 17, 25, 7, 29, …
$ avg_rating     <dbl> 4.8, 4.6, 4.3, 5.0, 3.8, 4.4, 4.3, 4.6, 5.0, 4.7, 4.4, …
$ total_ratings  <dbl> 25, 10, 126, 1, 13, 40, 3, 65, 2, 182, 19, 16, 20, 43, …
$ reviews        <dbl> 22, 9, 104, 1, 11, 32, 3, 55, 2, 138, 15, 16, 15, 39, 2…
$ prep_time      <dbl> 10, 30, 20, 10, 30, 30, 30, 0, 5, 5, 10, 10, 15, 60, 10…
$ cook_time      <dbl> 5, 75, 15, 0, 15, 165, 75, 0, 5, 0, 10, 50, 15, 10, 45,…
$ total_time     <dbl> 15, 180, 180, 10, 45, 675, 585, 0, 10, 5, 50, 300, 60, …
$ servings       <dbl> 2, 16, 12, 6, 15, 6, 6, 24, 1, 21, 4, 10, 8, 6, 10, 6, …

Summaries: Examining the Data

cuisines_factor %>%
  group_by(country) %>%
  summarise(
    count = n()
  ) %>%
  arrange(desc(count)) %>% 
  datatable(cuisines_factor)
cuisines_factor %>%
  group_by(country) %>%
  summarise(
    avg_calories = mean(calories, na.rm = TRUE),
  ) %>%
  arrange(desc(avg_calories)) %>% 
  datatable(cuisines_factor)
cuisines_factor %>%
  group_by(country) %>%
  summarise(
    avg_fat = mean(fat, na.rm = TRUE),
  ) %>%
  arrange(desc(avg_fat)) %>% 
  datatable(cuisines_factor)
cuisines_factor %>%
  group_by(country) %>%
  summarise(
    avg_carbs = mean(carbs, na.rm = TRUE),
  ) %>%
  arrange(desc(avg_carbs)) %>% 
  datatable(cuisines_factor)
cuisines_factor %>%
  group_by(country) %>%
  summarise(
    avg_protien = mean(protein, na.rm = TRUE),
  ) %>%
  arrange(desc(avg_protien)) %>% 
  datatable(cuisines_factor)
cuisines_factor %>%
  group_by(country, name) %>%
  summarise(
    avg_calories = mean(calories, na.rm = TRUE),
    average_fat = mean(fat, na.rm = TRUE),
    avg_carbs = mean(carbs, na.rm = TRUE),
    avg_protein = mean(protein, na.rm = TRUE)
  ) %>%
  arrange(desc(avg_calories)) %>% 
  datatable(cuisines_factor)
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
cuisines_factor %>%
  group_by(country, name) %>%
  summarise(
    avg_calories = mean(calories, na.rm = TRUE),
    avg_rating = mean(avg_rating, narm = TRUE)
  ) %>%
  arrange(desc(avg_calories)) %>% 
  datatable(cuisines_factor)
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
cuisines_factor %>%
  group_by(country, name) %>%
  summarise(
    avg_calories = mean(calories, na.rm = TRUE),
    avg_fat = mean(fat, na.rm = TRUE),
    avg_carbs = mean(carbs, na.rm = TRUE),
    avg_protein = mean(protein, na.rm = TRUE)
  ) %>%
  arrange(desc(avg_fat)) %>% 
  datatable(cuisines_factor)
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
cuisines_factor %>%
  group_by(country, name) %>%
  summarise(
    avg_calories = mean(calories, na.rm = TRUE),
    avg_fat = mean(fat, na.rm = TRUE),
    avg_carbs = mean(carbs, na.rm = TRUE),
    avg_protein = mean(protein, na.rm = TRUE)
  ) %>%
  arrange(desc(avg_carbs)) %>% 
  datatable(cuisines_factor)
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
cuisines_factor %>%
  group_by(country, name) %>%
  summarise(
    avg_calories = mean(calories, na.rm = TRUE),
    avg_fat = mean(fat, na.rm = TRUE),
    avg_carbs = mean(carbs, na.rm = TRUE),
    avg_protein = mean(protein, na.rm = TRUE)
  ) %>%
  arrange(desc(avg_protein)) %>% 
  datatable(cuisines_factor)
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
cuisines_factor %>%
  group_by(country) %>%
  summarise(
    avg_prep_time = mean(prep_time, na.rm = TRUE),
    avg_cook_time = mean(cook_time, na.rm = TRUE),
    avg_total_time = mean(total_time, na.rm = TRUE),
    avg_rating = mean(avg_rating, na.rm = TRUE),
    avg_total_ratings = mean(total_ratings, na.rm = TRUE),
    avg_reviews = mean(reviews, na.rm = TRUE),
    avg_servings = mean(servings, na.rm = TRUE),
  ) %>%
  arrange(desc(avg_total_time)) %>% 
  datatable(cuisines_factor)
cuisines_factor %>%
  group_by(country) %>%
  summarise(
    avg_prep_time = mean(prep_time, na.rm = TRUE),
    avg_cook_time = mean(cook_time, na.rm = TRUE),
    avg_total_time = mean(total_time, na.rm = TRUE),
    avg_rating = mean(avg_rating, na.rm = TRUE),
    avg_total_ratings = mean(total_ratings, na.rm = TRUE),
    avg_reviews = mean(reviews, na.rm = TRUE),
    avg_servings = mean(servings, na.rm = TRUE),
  ) %>%
  arrange(desc(avg_total_ratings)) %>% 
  datatable(cuisines_factor)
cuisines_factor %>%
  group_by(name, country) %>%
  summarise(
    max_prep_time = mean(prep_time, na.rm = TRUE),
    max_cook_time = mean(cook_time, na.rm = TRUE),
    max_total_time = mean(total_time, na.rm = TRUE),
    max_servings = mean(servings, na.rm = TRUE)
  ) %>%
  arrange(desc(max_total_time)) %>% 
  datatable(cuisines_factor)
`summarise()` has grouped output by 'name'. You can override using the
`.groups` argument.
  • Here we see that for certain dishes, the prep time, cook time & total time is 0 - which is inaccurate.
  • The prep_time + cook_time doesn’t add up to the total_time as mentioned in the data dictionary.

Filtering out 0 values in Prep, Cook & Total Time:

cuisines_clean <- cuisines_factor %>%
  filter(prep_time > 0, cook_time > 0, total_time > 0)
cuisines_clean %>% arrange(desc(total_time)) %>% select(name, country, prep_time,cook_time, total_time) %>% 
  datatable(cuisines_clean)
cuisine_correct_time <- cuisines_clean %>%
  mutate(sum_time = prep_time + cook_time)
cuisine_correct_time %>% 
  select(name, country, prep_time,cook_time, total_time, sum_time) %>% 
  datatable(cuisine_correct_time)
cuisine_correct_time %>% filter(total_time!=sum_time) %>% 
  select(name, country, prep_time,cook_time, total_time, sum_time) %>% 
  datatable(cuisine_correct_time)

From the data dictionary, cook_time does not account for fermentation, marination, waiting periods which is done in dishes like breads, chicken etc.

  • prep_time = before heat (chopping, mixing)
  • cook_time = with heat (boil, bake, fry etc)
  • total_time = prep + cook (but not fermentation/marination)

We also observed that the total_time values do not match the sum of prep_time and cook_time, even when excluding processes like fermentation or marination. This shows inconsistencies in the dataset due to missing or incorrectly recorded values from the original sources.

cuisines_factor %>%
  group_by(country) %>%
  summarise(avg_rating = mean(avg_rating, na.rm = TRUE),
            total_reviews = sum(reviews, na.rm = TRUE)) %>% arrange(desc(avg_rating)) %>% 
  datatable(cuisines_factor)
cuisines_factor %>%
  group_by(name) %>%
  summarise(max_rating = max(avg_rating, na.rm = TRUE),
            total_reviews = sum(reviews, na.rm = TRUE)) %>% arrange(desc(max_rating)) %>% 
  datatable(cuisines_factor)
cuisines_factor %>%
  group_by(name,servings) %>%
  summarise(prep_time_avg = mean(prep_time)) %>% arrange(desc(servings)) %>% 
  datatable(cuisines_factor)
`summarise()` has grouped output by 'name'. You can override using the
`.groups` argument.
cuisines_continent <- cuisines_factor %>%
  mutate(
    continent = case_when(
      country %in% c("Tex-Mex", "Amish and Mennonite", "Southern Recipes", 
                     "Cajun and Creole", "Canadian", "Soul Food") ~ "North America",
      
      country %in% c("Greek", "Polish", "Danish", "Belgian", "Spanish", "Portuguese", 
                     "Norwegian", "Austrian", "Swiss", "Russian", "Dutch", "German", 
                     "Swedish", "Finnish", "French", "Italian", "Scandinavian") ~ "Europe",
      
      country %in% c("Peruvian", "Argentinian", "Colombian", 
                     "Brazilian", "Chilean") ~ "South America",
      
      country %in% c("Vietnamese", "Japanese", "Israeli", "Thai", "Chinese", "Turkish", 
                     "Korean", "Lebanese", "Jewish", "Malaysian", "Bangladeshi", 
                     "Persian", "Indonesian", "Indian", "Pakistani", "Filipino") ~ "Asia",
      
      country %in% c("Australian and New Zealander") ~ "Oceania",
      
      country %in% c("Puerto Rican", "Jamaican", "South African", "Cuban") ~ "Caribbean/Africa",
      
      TRUE ~ "Other"
    )
  )

cuisines_continent %>% 
  select(name, country, continent) %>% 
  datatable(cuisines_continent) 

Visualising Information

Total Number of recipes (country wise) in allrecipes website

top_dishes <- cuisines_factor %>%
  group_by(country) %>%
  summarise(count = n()) %>%
  slice_max(n = 15, order_by = count) %>%    
  arrange(desc(count))
top_dishes %>% 
  gf_col(count~reorder(country,count), fill = "orchid4") %>%
  gf_labs(x = "Country", y = "Number of Dishes", title = "Top 10 Countries by Recipe Count") %>% 
  gf_theme(axis.text.x = element_text(angle = 45, hjust = 1))

  • Russian, Fillipino, Chinese and Canadian (63 recipes each) are the highest number of recipes present on the website.

Total Number of recipes (continent wise) in allrecipes website

top_continents <- cuisines_continent %>%
  group_by(continent) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

top_continents %>% 
  gf_col(count ~ reorder(continent, count), fill = "blue4") %>%
  gf_labs(
    x = "Continent",
    y = "Number of Dishes",
    title = "Number of Dishes by Continent"
  ) %>%
  gf_theme(axis.text.x = element_text(angle = 45, hjust = 1))

  • Asian Recipes are found maximum on the Allrepices website where as Oceania recipes are least found on the website.

1. Calorie consumption based on country

top_10_countries_calorie <- cuisines_factor %>%
  group_by(country) %>%
  summarise(
    avg_calories = mean(calories, na.rm = TRUE),
  ) %>%
  slice_max(n=10, order_by = avg_calories) %>% 
  slice_head(n = 10)
top_10_countries_calorie %>%
  gf_col(reorder(country, avg_calories)~avg_calories, fill = "steelblue") %>%
  gf_labs(
    title = "Top 10 Countries by Average Calories",
    x = "Average Calories",
    y = "Country"
  )

top_10_continent_calorie <- cuisines_continent %>%
  group_by(continent) %>%
  summarise(
    avg_calories = mean(calories, na.rm = TRUE),
  ) %>%
  slice_max(n=10, order_by = avg_calories) %>% 
  slice_head(n = 10)
top_10_continent_calorie %>%
  gf_col(reorder(continent, avg_calories)~avg_calories, fill = "steelblue2") %>%
  gf_labs(
    title = "Top 10 Continents by Average Calories",
    x = "Average Calories",
    y = "Continent"
  )

2. Protein consumption based on country

top_10_countries_protein <- cuisines_factor %>%
  group_by(country) %>%
  summarise(
    avg_protein = mean(protein, na.rm = TRUE),
  ) %>%
  slice_max(n=10, order_by = avg_protein)
top_10_countries_protein %>%
  gf_col(reorder(country, avg_protein)~avg_protein, fill = "darkorchid3") %>%
  gf_labs(
    title = "Top Countries by Average protein",
    x = "Average Protein",
    y = "Country"
  )

top_10_continent_protein <- cuisines_continent %>%
  group_by(continent) %>%
  summarise(
    avg_protein = mean(protein, na.rm = TRUE),
  ) %>%
  slice_max(n=10, order_by = avg_protein)
top_10_continent_protein %>%
  gf_col(reorder(continent, avg_protein)~avg_protein, fill = "darkorchid4") %>%
  gf_labs(
    title = "Top Continent by Average protein",
    x = "Average Protein",
    y = "Country"
  )

3. Carbs consumption based on country

top_15_countries_carbs <- cuisines_factor %>%
  group_by(country) %>%
  summarise(
    avg_carbs = mean(carbs, na.rm = TRUE),
  ) %>%
   slice_max(n=20, order_by = avg_carbs)
top_15_countries_carbs %>%
  gf_col(reorder(country, avg_carbs)~avg_carbs, fill = "rosybrown") %>%
  gf_labs(
    title = "Top 20 Countries by Average Carbs",
    x = "Average Carbs",
    y = "Country"
  )

top_15_continent_carbs <- cuisines_continent %>%
  group_by(continent) %>%
  summarise(
    avg_carbs = mean(carbs, na.rm = TRUE),
  ) %>%
   slice_max(n=20, order_by = avg_carbs)
top_15_continent_carbs %>%
  gf_col(reorder(continent, avg_carbs)~avg_carbs, fill = "rosybrown4") %>%
  gf_labs(
    title = "Top 20 Continent by Average Carbs",
    x = "Average Carbs",
    y = "Continent"
  )

4. Fat consumption based on country

top_5_countries_fat <- cuisines_factor %>%
  group_by(country) %>%
  summarise(
    avg_fat = mean(fat, na.rm = TRUE),
  ) %>%
   slice_max(n=10, order_by = avg_fat)
top_5_countries_fat %>% 
  gf_col(reorder(country, avg_fat)~avg_fat, fill = "royalblue3") %>%
  gf_labs(
    title = "Top Countries by Average Fat",
    x = "Average Fat",
    y = "Country"
  )

top_5_continent_fat <- cuisines_continent %>%
  group_by(continent) %>%
  summarise(
    avg_fat = mean(fat, na.rm = TRUE),
  ) %>%
   slice_max(n=10, order_by = avg_fat)
top_5_continent_fat %>% 
  gf_col(reorder(continent, avg_fat)~avg_fat, fill = "royalblue4") %>%
  gf_labs(
    title = "Top Continents by Average Fat",
    x = "Average Fat",
    y = "Continent"
  )

Case Study: Italian Food

After viewing the carbs, protein, fat content by countries- it is seen that Italian food have higher average nutrient content than other countries.

1. Calories Comparison

cuisines_factor %>%
  mutate(is_italian = ifelse(country == "Italian","Italian", "Other")) %>%
  gf_boxplot(calories ~ is_italian, fill = c("paleturquoise3", "paleturquoise4"), orientation = "x") %>%
  gf_labs(
    title = "Calories: Italian vs. Other Cuisines",
    x = "Cuisine Category",
    y = "Calories"
  )

2. Protein Comparison

cuisines_factor %>%
  mutate(is_italian = ifelse(country == "Italian","Italian", "Other")) %>%
  gf_boxplot(protein ~ is_italian, fill = c("palegreen3", "palegreen4"), orientation = "x") %>%
  gf_labs(
    title = "Protein: Italian vs. Other Cuisines",
    x = "Cuisine Category",
    y = "Protein"
  )

3. Fat Comparison

cuisines_factor %>%
  mutate(is_italian = ifelse(country == "Italian","Italian", "Other")) %>%
  gf_boxplot(fat ~ is_italian, fill = c("royalblue", "royalblue4"), orientation = "x") %>%
  gf_labs(
    title = "Fat Content: Italian vs. Other Cuisines",
    x = "Cuisine Category",
    y = "Fat"
  )

We seen there’s an higher median for Italian food nutrients - does having higher calories, fat and protein make it unhealthy?

Italian cooking is deeply rooted in the Mediterranean diet, which is widely recognized by nutritionists and health experts for its numerous benefits. A common misconception is having high number of calories, protein and fats - makes food unhealthy. However, not all high-calorie or high-fat foods are unhealthy. Italian diet uses whole grains, fruits, vegetables, legumes and olive oil as the primary sources of fat. Therefore, many traditional Italian dishes naturally incorporate these healthy elements.

  • Olive Oil: Rich in monounsaturated fats, known for their heart-healthy properties. Tomatoes: Packed with antioxidants like lycopene.
  • Garlic and Onions: Offer various health benefits and add flavor without excessive calories.

Correlation between Variables

GGally::ggpairs(
  cuisines_factor %>% drop_na(),
  columns = c(
    "fat", "calories", "protein", "servings"
  ),
  switch = "both",
  progress = FALSE,

  diag = list(continuous = "barDiag"),
  lower = list(continuous = wrap("smooth", alpha = 0.3, se = FALSE)),

  title = "Cuisine Data Correlations Plot"
)
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

  • Fat & Protein have a high +ve correlation.
  • Calories & Fat have a high +ve correlation which shows that fat is a major contributor to calories.
  • Calories & Protein have a high +ve correlation which shows that protein is a major contributor to calories.
  • Number of servings, calories, fat and protein have a very low correlation.
GGally::ggpairs(
  cuisine_correct_time %>% drop_na(),
  columns = c(
    "servings", "sum_time", "cook_time", "prep_time", "total_ratings","reviews","avg_rating"
  ),
  switch = "both",
  progress = FALSE,

  diag = list(continuous = "barDiag"),
  lower = list(continuous = wrap("smooth", alpha = 0.3, se = FALSE)),

  title = "Cuisine Data Correlations Plot"
)
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

  • From the above plot, it is seen that Reviews and Total Rating have a high +ve correlation.Which means the number of Reviews is an important factor for a recipe to get a high rating.

Hypothesis: Dishes that have more number of servings will have more total time.

cuisine_correct_time %>%
  gf_point(sum_time ~ servings, alpha = 0.5, color = "darkblue") %>%
  gf_labs(
    title = "Total Time vs. Number of Servings",
    x = "Servings",
    y = "Total Time"
  ) %>%
  gf_lm(color="black")

  • The trend is not that significant - not a strong correlation between the 2 variables.

Hypothesis: The more number of reviews, the higher the rating

cuisines_factor %>%
  gf_point(reviews~total_ratings, alpha = 0.5, color = "darkviolet") %>%
  gf_labs(
    title = "Scatter Plot of Total Ratings vs. Reviews",
    x = "Total Reviews",
    y = "Total Ratings"
  ) %>%
  gf_lm(color = "black")

The scatterplot shows a high +ve relation - which proves the hypothesis that total reviews and total ratings are +vely correlated.

Hypothesis: Food with more calories have higher ratings

cuisines_factor %>% 
  gf_point(total_ratings~calories, alpha = 0.5, color = "seagreen") %>% 
  gf_lm(color = "black") %>% 
  gf_labs(title ="Calories vs Total Ratings",
          x = "Calories",
          y ="Total Ratings")

The scatterplot shows that Calories and Total Ratings have no correlation - which disproves the hypothesis.

Hypothesis: Recipes with a longer preparation time will have fewer total reviews since preparation is not that easy

cuisine_correct_time %>% 
  gf_point(sum_time~total_ratings, alpha = 0.5, color = "royalblue") %>% 
  gf_labs(title="Time vs Reviews",
          x = "Total Time",
          y = "Total Ratings") %>% 
  gf_lm(color = "black")

Here it is seen that, the r value is -0.02 which is a very low relation. Therefore, having a longer or shorter preparation time is not a strong determining factor in how many people review a recipe.

cuisines_factor %>%
  gf_point(calories ~ servings, alpha = 0.5, color = "darkred") %>%
  gf_labs(
    title = "Calories vs. Servings",
    x = "Servings",
    y = "Calories"
  ) %>%
  gf_lm(color = "black")

mosaic::cor_test(calories ~ servings, data = cuisines_factor) %>%
  broom::tidy() %>%
  knitr::kable(
    digits = 2,
    caption = "Total Time vs Total Ratings"
  )
Total Time vs Total Ratings
estimate statistic p.value parameter conf.low conf.high method alternative
-0.32 -15.31 0 1987 -0.36 -0.28 Pearson’s product-moment correlation two.sided
  • The correlation between Calories and Servings is present but is weak (r= -0.32)

Total Time vs Country

top_5_countries_time<- cuisine_correct_time %>%
  group_by(country) %>%
  summarise(
    avg_tt= mean(sum_time, na.rm = TRUE),
  )
top_5_countries_time %>% arrange(desc(avg_tt)) %>% 
  datatable(top_5_countries_time)
top_5_countries_time<- cuisine_correct_time %>%
  group_by(country) %>%
  summarise(
    avg_tt= mean(sum_time, na.rm = TRUE),
  ) %>%
   slice_max(n=35, order_by = avg_tt)
top_5_countries_time %>% 
  gf_col(reorder(country, avg_tt)~avg_tt, fill = "seagreen")%>%
  gf_labs(
    title = "Top Countries by Average Time",
    x = "Average Time",
    y = "Country"
  )

Observations:

  1. Highest Total Time

    • Norwegian and Portuguese have the highest average time to cook their food.

    • Cuban, Soul Food, Polish, German, Jewish, Persian also have very long total times.

    • These cuisines rely heavily on slow cooking, roasting etc

  2. Moderate Total Time

    • Brazilian, Greek, South African, Belgian, Swiss, Spanish, Indonesian, Southern Recipes, Canadian, Malaysian, Pakistani, Peruvian have moderate time of preparation
  3. Quick Cooking Cuisines

    • Chinese, Colombian, Japanese, Thai, Swedish, Indian, Vietnamese, Dutch

    • These cuisines use fast stir-fry, steaming etc.

Dish Rating

top_10_dishes_rating <- cuisines_factor %>%
  group_by(name) %>%
  summarise(
    max_total_rating = max(total_ratings, na.rm = TRUE),
    max_reviews = max(reviews, na.rm = TRUE)
  ) %>%

  arrange(desc(max_total_rating)) %>% 
  slice_head(n=20)


top_10_dishes_rating %>%
  gf_col(reorder(name, max_total_rating) ~ max_total_rating, fill = "indianred") %>%
  gf_labs(
    title = "Top 20 Highest Rated Recipes",
    x = "Total Rating",
    y = "Dish Name"
  )

top_10_dishes_rating %>% 
  gf_col(reorder(name, max_reviews) ~ max_reviews, fill = "indianred") %>%
  gf_labs(
    title = "Top 20 Highest Reviewed Recipes",
    x = "Max Reviews",
    y = "Dish Name"
  )

As we saw in the previous correlation graph, Reviews and Rating have a high +ve correlation.

The top three recipes are the same in both charts:

  • Cheesy Amish Breakfast Casserole

  • Blender Hollandaise Sauce

  • Boneless Buffalo Wings

The similarity in the rankings implies that getting a high amount of reviews is a major factor of a high total rating. If a recipe is good, it gets a lot of reviews which increases the Total Rating. Whereas, a poor recipe wouldn’t maintain both a high review volume or a high total rating.

Conclusion & Inferences

In conclusion, we see that cuisine recipes in the Allrecipes website are:

1. Some cuisines, like Italian - have more Calories, Fat and Protein than average. This doesn’t mean its unhealthy, as the ingredients used matter.

2. If a recipe is high in calories, it’s usually high in Fat and Protein too. Fats and Protein are highly contribute to calories.

3. Recipes from places like Norway and Portugal take the longest to make because they rely on slow cooking. Whereas recipes from China and Japan are usually the fastest.

4. Reviews are a major contributor to the total rating on the website.

5. Russian, Fillipino, Chinese and Canadian (63 recipes each) are the highest number of recipes present on the website.

6 - North America has the highest calorie, protein & fat consumption. - Oceania has the highes carbs consumption.

Overall, it was interesting to analyse the understand the cuisine characteristic around the world and user preferences on the Allrecipes website.